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1  Introduction 


Many  problems  in  computational  physics  are  characterized  as  multiscale,  whether  spatial,  temporal, 
or  involving  other  dimensions.  Efficient  integration  of  the  dynamics  over  the  large  scales  of  interest 
is  therefore  an  issue  of  key  importance  for  many  applications.  Numerical  schemes  developed  for 
this  purpose  must  also  satisfy  accuracy  and  stability  constraints,  and  a  variety  of  approaches  have 
been  developed  to  tackle  these  issues  [1,2].  Some  of  these  make  use  of  the  particular  structure 
of  the  dynamical  equations,  and  have  a  limited  range  of  applicability.  Others  are  more  general, 
but  based  on  general  principles  and  guidelines  [1],  and  requiring  further  development  for  specific 
applications.  Here,  we  examine  a  relatively  recent  approach,  the  Parareal  method,  [3],  which  we  refer 
to  as  being  within  a  more  general  class  of  time  parallel  (TP)  methods,  and  investigate  variations 
of  the  procedure.  The  Parareal  method  is  an  iterative  error  correction  between  two  different  scales, 
coarse  and  fine,  and  is  closely  related  to  the  path  optimization  problem  and  the  multi-grid  scheme 
applied  to  the  time  domain.  The  multiscale  perspective  employed  by  the  Parareal  method,  while  not 
reducing  the  aggregate  computations  required,  as  many  other  approaches,  does  allow  for  greater 
parallelization,  and  thus  faster  real  time  computing  with  the  appropriate  infrastructure.  In  this 
paper,  we  present  the  Parareal  method  within  an  optimization  framework  and  show  alternatives 
that  can  lead  to  accelerated  convergence  over  a  wider  range  of  scales.  We  first  review  the  standard 
Parareal  algorithm,  and  discuss  some  of  the  modified  approaches  that  have  been  developed.  Various 
numerical  tests  are  conducted  next,  and  the  results  are  compared.  We  offer  insights  on  future 
directions  in  the  conclusion  section. 

2  Time  Parallel  (TP)  Review 

2.1  Motivation  for  the  TP  Approach 

The  fundamental  problem  of  interest  can  be  described 

^  =  fix),  with 

where  x(t)  :  R  — »•  R^and  f(x)  :  UD  — »■  RD.  Equation 
equation  (ODE)  or  a  spatially  discretized,  time-dependent  partial  differential  equation  (PDE).  The 
numerical  procedure  for  solving  for  the  solution  x  involves  discretizing  the  time  derivative  of  x  and 
evaluating  /  at  discrete  locations  [4].  We  can  express  this  in  a  compact  form: 

xt+ 1  =  F{xt)  (2) 

where  t  €  [0,  T]  ,  T  :  RD  — >•  RD  is  the  discrete  propagator,  and  Xt  is  a  numerical  approximation 
to  x(t).  Several  examples  of  common  propagators  are  given  in  the  examples  of  Section  4  below. 
Numerically  solving  for  a  differential  equation  (DE)  with  forward  propagation  results  in  a  perfectly 
sequential  numerical  procedure;  this  is  the  most  natural  way  to  model  time-dependent  behavior. 
However,  sequentially  solving  DEs  is  far  from  ideal  from  a  modern  computational  perspective,  since 
the  approach  does  not  take  advantage  of  the  potential  parallelization  efficiency  of  GPUs  (Graph¬ 
ics  Processing  Units)  and  multi-core  CPUs  (Central  Processing  Units).  While  a  solver  involving 
parallel  computations  in  the  time  domain  may  be  much  more  complex  and  require  a  larger  num¬ 
ber  of  operations,  the  parallel  method  could  potentially  result  in  greatly  reduced  computational 
time  (“wall-clock”),  as  a  result  of  the  simultaneous  execution  across  may  processors.  To  put  it  an¬ 
other  way,  a  time  parallel  (TP)  approach  consists  of  creating  unnatural  and  less  efficient  schemes 
that,  nonetheless,  are  faster  because  they  better  utilize  existing  computational  technology.  A  basic 
overview  of  the  challenges  involved  with  such  an  approach  is  given  in  [5]. 


as: 

x(0)=x0  (1) 

(1)  describes  either  an  ordinary  differential 
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We  will  focus  exclusively  here  on  TP  methods,  as  opposed  to  other  parallelization  algorithms  that 
can  be  applied  within  a  single  time  step.  There  are  few  good  alternatives  for  solving  ODEs  beyond 
TP  approaches  as  the  number  of  computations  at  a  single  time  step  is  often  relatively  small.  For 
PDEs,  other  appropriate  parallelization  approaches  exist  such  as  the  decomposition  of  the  spatial 
domain  or  the  parallelization  of  a  linear  solver  [6]within  a  single  time  step.  Time  parallel  methods 
are  not  always  mutually  exclusive  to  these,  if  there  is  sufficient  computing  capacity. 

Although  the  important  characteristics  of  a  time  parallel  implementation  can  be  somewhat  problem- 
specific,  there  are  a  few,  generally  applicable  desirable  properties  required,  which  can  give  a  general 
perspective. 

1.  Accuracy:  The  solution  from  the  parallel  algorithm  should  be  nearly  identical  to  that  of  a 
well-established  serial  approach.  Improved  accuracy  is  given  as  a  motivation  for  developing 
some  of  the  TP  schemes  presented  in  Section  3.2  and  some  general  accuracy  estimates  are 
given  in  Section  3.4.  A  discussion  of  the  accuracy  results  for  some  specific  problems  is  given 
in  Section  4.3. 

2.  Stability:  The  error  from  a  TP  method  should  not  become  infinitely  large  when  the  serial 
propagators  are  stable  and  accurate.  We  discuss  how  instability  may  arise  in  some  cases  and 
how  it  can  be  addressed  in  Section  3.3. 

3.  Parallelization  Efficiency:  The  number  of  required  serial  computations  should  be  as  small  as 
possible.  Additionally,  data  exchange  should  be  minimized,  i.e.  a  processor  should  be  given 
as  large  a  block  of  computational  work  as  possible  and  should  have  to  send  and  receive  as 
little  data  as  possible  to  other  processors.  Some  discussion  of  this  is  presented  in  Section  4.4. 

2.2  Existing  Time  Parallel  Approaches 

The  Parareal  method  [3]  involves  both  coarse-scale  and  fine-scale  solutions,  advanced  respectively 
with  a  coarse  propagator  C,  and  a  fine  propagator  J r,  which  are  iteratively  updated.  In  this  context, 
the  solution  is  no  longer  described  using  t  alone,  i.e.: 

xt  ->•  (3) 

where  u  is  the  iteration  number,  while  v  €  [0,  V]  serves  a  similar  purpose  to  t  at  each  iteration, 
but  a  different  symbol  is  used  to  indicate  that  it  is  not  time  as  it  is  classically  understood.  The 
starting  point  of  the  method  is  the  coarse  solution,  obtained  by  sequentially  advancing  the  coarse 
propagator  C: 

Step  0  [Serial]:  x^+\  =  C(Xy)  (4) 

After  this  initial  coarse  solution  is  found  a  two-step  iterative  procedure  is  performed.  In  the  parallel 
step  (1),  both  C  and  the  computationally  intensive  fine  propagator  T  are  run  once  in  parallel  for 
each  v.  In  the  serial  step  (2),  the  C  is  applied  sequentially. 

Step  1  [Parallel,  Iterative]:  ^v+i  =  J~(xv)  ~  C(xv)  (5) 

Step  2  [Serial,  Iterative]:  xX+\  =  ^(xv+1)  +  (6) 

The  fine  propagator  can  be  anything  more  computationally  intensive  and  accurate  than  the  coarse 
propagator,  but  the  greatest  potential  speed-up  is  achieved  when  this  difference  is  somewhat  large, 
i.e.  for  large  separation  of  scales.  In  this  scheme,  v  is  to  be  considered  a  ‘macroscale’  variable 
associated  with  a  time  step  dv,  with  C  applied  on  this  scale.  Similarly,  w  is  a  ‘microscale’  variable 
with  time  step  dw,  and  J7  is  a  compact  way  of  describing  a  multi-step  integration  procedure.  The 
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solution  x  can  be  expressed  in  terms  of  both  a  macroscale  variable  v  €  [0,  V]  and  a  microscale 
variable  w  €  [0,  W],  where  W  is  number  of  microsteps  per  macrostep. 

—  VCvi  3'v,W  =  (  ) 

The  fine  propagator,  J7,  can  be  written  in  such  a  way  to  signify  the  number  of  times  a  microscale 
propagator  is  applied,  W,  using  the  following  notation: 


F(xv)  =  J[0 ,W]{xv)  =  F\w-i,w\  °  F\w-2,w-i]  °  °  ^[o,i](®«,o) 


(8) 


The  Parareal  method  has  been  derived  via  linear  algebra  as  well  as  in  other  mathematical  contexts; 
various  modifications  to  the  basic  Parareal  method  have  also  been  proposed  [7-12],  some  of  which 
may  be  particularly  well  suited  to  certain  types  of  DEs.  In  [11],  the  authors  propose  employing  the 
compound-wavelet  method  for  adjusting  the  coarse  solution  based  on  the  coarse  and  fine  solutions 
computed  in  parallel  on  each  interval.  This  may  be  particularly  useful  when  the  fine  solution  is 
noisy.  Additionally,  the  authors  claim  that  in  some  circumstances  this  approach  may  allow  the  fine 
time  steps  to  only  be  computed  on  a  small  portion  of  the  time  domain.  In  [9],  a  method  which  works 
well  for  certain  structural  dynamics  problems  is  presented  and  [10]  modifies  the  original  Parareal 
method  to  better  handle  cases  where  large  matrices  are  involved. 

Other  types  of  TP  approaches  beyond  the  Parareal  method  exist  as  well.  The  Waveform  Relaxation 
Method  [13,  14]  solves  large  systems  of  differential  equations  by  breaking  down  the  system  into 
loosely  coupled  subsystems.  In  [5]  an  approach  to  solving  linear  ODEs  in  parallel  is  discussed  with 
a  means  to  extend  this  approach  to  a  few  limited  types  of  ODEs.  Lastly,  we  note  that  TP  approaches 
have  connections  to  iterative  root-finding  algorithms,  such  as  Newton  approximation  methods  and 
Halley’s  method;  that  is,  the  entire  DE  can  be  discretized  in  both  space  and  time,  with  appropriate 
boundary  conditions,  and  solved  for  the  unknown  variables. 


2.3  The  Functional  Optimization  Approach 

The  problem  of  finding  a  solution  to  a  ODE  can  by  reformulated  as  a  problem  of  finding  the 
minimum  of  the  functional,  $  :  RD  — >  [R,  given  by: 

=  \fo  -f&w)  dt=\f0  (9) 

This  is  an  action  integral,  and  the  function  x  that  minimizes  satisfies: 

d^p--f(x(t))  =  (p(x{t))  =  0  (10) 

At  the  most  basic  level,  the  reason  for  incorporating  this  functional  is  that  it  quantifies  how  far 
away  any  given  function  is  from  satisfying  the  ODE  of  interest.  Functional  optimization  techniques 
can  then  be  employed  to  find  variations  of  x  that  can  be  subtracted  to  reduce  the  value  of  <L.  The 
following  equations  are  written  for  a  vector-valued  x(t)  corresponding  to  an  ODE,  but  generalizes 

readily  to  the  PDE  case  with  the  replacement  with  the  replacement  of  the  discrete  dot  product: 


d(x(t)) 

dt 


d(x{t )) 
dt 


r  ( d(x(z,t)) 

AA  dt 


f(x(z,  t))^j  •  *))  -/(a;(z,  t))^  dz 

(11) 
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2.4  The  L 2  Gradient 

The  inner  product  of  the  L2  gradient  of  the  functional  and  a  small  variation  h(t)  is  given  by: 


with 


VL2<f>(» 


df_  f  d(x(t)) 
dx  x  V  dt 


This  can  be  computed  directly  from  the  functional  [15]. 


d  ( d(x(t)) 
dt  \  dt 


f(x{t)) 


(12) 


(13) 


Note  that  here  and  in  what  follows,  we  do  not  specify  boundary  conditions.  The  only  boundary 
condition  is  at  the  point  x(0)  and  it  remains  fixed.  One  approach  to  finding  the  functional  minimum 
is  to  solve  for  when  this  derivative  is  0  (the  Euler-Lagrange  equation)  [15]. 


VL2&(x)  =  0 


(14) 


However,  this  involves  solving  a  large  non-linear  system,  where  the  size  is  based  on  the  time  step 
number,  and  the  appropriate  nonlinear  solver  to  use  is  not  immediately  clear.  As  an  alternative,  a 
popular  approach  is  to  employ  gradient  descent  [15]: 

^  (15) 

du 

where  x  is  now  considered  to  be  a  function  of  two  variables:  x(u,v)  :  IR  x  [R  — »  RD.  If  implemented 
explicitly,  no  linear  system  needs  to  be  solved: 

Tu+ 1  _  TU 

V  du  V  =  VL2#«)  (16) 

where  du  is  the  artificial  time  step.  This  approach  is,  however,  impractical  for  at  least  two  reasons. 
The  first  issue  is  the  severe  size  limitation  on  du:  the  L 2  gradient  of  4>  has  a  second  derivative  in 
v,  which  yields  an  oppressive  time  step  restriction  of  the  form  du<Kdv2.  The  other  issue  is  the 
speed  of  propagation  of  information:  if  the  values  of  x  change  near  v  ~  0,  the  values  at  the  end  of 
the  time  domain,  v  ~  V,  will  not  be  affected  until  many  artificial  time  steps  have  occurred. 


2.5  The  Sobolev  Gradient 

The  Sobolev  Gradient  can  replace  the  L2  gradient  and  lead  to  a  faster  numerical  computation  of  a 
functional  minimum  [16-24].  The  theory  behind  this  approach  is  well  established  [16].  One  way  this 
has  been  explained  is  by  comparing  the  inner  products  of  the  standard  L2  and  the  most  common 
Sobolev  inner  product,  the  H 1  inner  product  [16, 17].  In  the  L2  case,  it  is  given  by: 


k(t)-h(t)dt 


(17) 


This  inner  product  is  assumed  when  computing  the  L 2  gradient  of  a  functional.  The  H 1  inner 
product  is  given  by: 

(k,h)Hl=J^  ^k(t)-h(t)  +  dt  (18) 

In  general,  the  Sobolev  inner  product  depends  only  on  the  two  functions  and  their  derivatives. 
Higher-order  derivatives  can  be  included  if  the  functional  of  interest  also  contains  higher  order 
derivatives  [17]. 
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The  H 1  gradient  is  not  computed  directly  from  a  functional.  Instead,  the  L 2  gradient  is  com¬ 
puted  and  then  modified  based  on  the  relationship  between  the  L2  and  H1  inner  products.  With 
integration  by  parts, 


•  h(t )  dt 


(19) 


Taking  the  directional  (Frechet)  derivative  in  the  different  norms  gives  [16, 17]  : 


&(x)h  =  (VL2<h(x),h)L2 


i.e. 


f 


hdt  = 


Jo 


h  dt 


If  this  holds  for  any  variation  h(t), 


2  \  -i 


VHi<S>(x)=  (/-^)  Vt;Mx) 


(20) 


(21) 


(22) 


Thus,  to  obtain  the  H1  and  the  Sobolev  gradient  in  general,  the  L 2  gradient  is  computed  and  a 
sparse  linear  system  must  be  solved  for.  The  H1  gradient  is  smoother  than  L2  and  allows  much 
larger  time  steps  to  be  taken.  This  stems  from  the  fact  the  Sobolev  inner  product  also  depends 
on  the  derivatives  of  the  input  functions.  This  approach  has  been  shown  effective  in  a  wide  range 
of  applications,  including  physical  modeling  [18],  image  segmentation  and  registration  [19-21],  and 
tetrahedral  mesh  generation  [22]. 

The  Sobolev  gradient  has  been  examined  specifically  in  the  context  of  ODEs  [23,24].  While  shown 
to  be  far  better  than  the  L 2  gradient,  a  few  major  problems  still  exist.  One  issue  is  that  it  takes 
too  many  iterations  to  converge;  the  number  of  iterations  is  large  enough  that  it  is  hard  to  argue 
that  it  is  faster  than  the  sequential  approach  under  any  reasonable  computing  system.  Additionally, 
the  method  does  not  immediately  parallelize  well.  This  is  due  to  the  tridiagonal  system  (22)  that 
must  be  solved.  While  these  systems  can  be  serially  solved  quickly  [25],  less  efficient  iterative 
schemes  must  be  employed  for  parallel  solutions  [6].  Some  of  these  problems  were  considered  in 
[24],  where  it  was  proposed  that  the  time  domain  could  be  broken  down  into  sub-intervals  and 
variable  time  step  sizes  employed.  But  many  of  the  general  strategies  presented  could  apply  to 
other  optimization  approaches  and  further  work  is  still  required  to  make  an  approach  like  this 
competitive  in  computation  time  with  the  sequential  approach. 

To  make  an  optimization  approach  like  the  Sobolev  gradient  method  viable,  we  need  orders  of 
magnitude  reductions  in  the  iteration  count  as  well  as  a  construction  that  relies  only  on  algorithms 
that  can  be  done  efficiently  and  directly  with  parallel  computing.  We  describe  such  an  approach  in 
Section  3  below. 


3  Efficient  TP  Solvers  via  Functional  Optimization 

3.1  Derivation  of  the  Time  Parallel  PDE 

While  the  Sobolev  gradient  produces  a  smoother  descent  that  the  L2  gradient,  it  still  does  not 
prevent  x  from  updating  to  configurations  that  differ  significantly  from  the  true  solution  of  the  DE. 
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Thus,  we  are  led  to  consider  an  inner  product  and  an  associated  norm  that  more  closely  resembles 


(M 


at  ax 


<23> 


with 


=  (M}nK  (x) 


(24) 


The  properties  of  an  inner  product,  including  linearity,  are  satisfied  for  a  given  h  and  k  and  arbitrary 
x.  Intuitively,  this  inner  product  resembles  one  defined  on  a  manifold,  although  it  is  for  a  function 
x  rather  than  for  a  surface.  We  refer  to  this  inner  product  as  the  DE  inner  product  and  consider 
the  variations  in  the  solution  to  be  bounded  in  the  DE  norm.  The  variations  can  be  calculated  by 
comparing  the  L 2  gradient  and  the  DE  gradient  as  is  done  for  the  case  of  the  Sobolev  gradient. 


n 


riAv 

dx  DE 


<h(x) 


dtp 

dx 


h(t)dt  =  f  Lp-^f-h(t)dt 
Jo  dx 


If  this  holds  for  arbitrary  h, 


Combining  the  gradient  descent  equation  with  this  gradient  yields: 

dp  dx 

7tiTu+Lf'  =  0 


Written  in  terms  of  /  :  — >•  ,  this  equation  is: 


d2x 

dvdu 


d[_ 

dx 


dx  dx  „ .  , 


(25) 


(26) 


(27) 


(28) 


with  x(u,v)  :  IR  x  IR  — >•  [R  .  This  equation  can  also  be  written  equivalently  using  propagators  T\ 


d  dT 

—x(u,v+dv) - — 

du  dx 


dx(u , v) 


x(u,v) 


du 


+  x(u,  v+dv)  —  F{x{u,  v))  =  0 


(29) 


which  allows  known  serial  propagators  to  be  encapsulated  in  this  TP  PDE  more  easily. 


At  first  glance,  this  may  seem  to  be  a  step  backwards,  as  we  have  potentially  replaced  an  ODE  with 
a  second  order  non-linear  PDE,  whose  solutions  are  generally  much  more  computationally  intensive. 
However,  this  PDE  is  an  artificial  one  that  is  designed  to  solve  a  computational  problem  and  can 
be  solved  significantly  faster  than  an  arbitrary  PDE  of  this  order.  By  design,  it  is  well  conditioned 
with  step  size  du  =  1  and  convergences  rapidly  to  the  solution  of  the  serial  DE  in  only  a  few  steps  in 
the  u  direction.  Furthermore,  unlike  the  direct  serial  solution  to  the  DE,  solving  for  this  particular 
PDE  can  involve  parallel  processes  and  can  be  broken  down  into  two  steps:  (1)  “microscale”  and 
(2)  “macroscale”.  Ten  numerical  schemes  that  allow  for  parallelization  are  presented  in  Section 
3.2.  The  first  few  schemes  simply  alternate  updates  in  the  v  and  u  variables  (where  v  updates  use 
serial  DE  propagators  on  independent  intervals  and  u  updates  combine  the  independtly  computed 
information)  while  the  rest  have  a  carefully  chosen  type  of  u,  v  stencil  that  still  allows  for  a  TP 
implement  at  ion . 
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3.2  Numerical  Schemes 

Scheme  1  (dJ7):  Direct  discretization 

A  first  order  explicit  discretization  in  the  u  variable  yields: 


mv  = 


xu+i  _  xu 


du 


v_  —  ™«+i  _  ^ 

•Xj  n  <  »Xj  n  ' 


(30) 


Tflv-\-l 


dJ_ 

dx 


m , 


+  <+1-.f«)  =  o 


(31) 


The  propagator  T  can  be  chosen  based  on  an  appropriate  choice  for  the  serial  solver,  with  ^ 
the  derivative  of  this  propagator  with  respect  to  x.  While  T  is  computed  over  many  time  steps, 
only  a  single  vector  of  size  D  needs  to  be  exported  to  the  serial  solver  along  with  matrix 
For  the  examples  presented,  we  take  T  =  .7-jo tw]  1°  be  the  repeated  application  of  a  few  common 
propagators,  with  a  microscale  time  step  size  dw,  as  given  in  Equation  (8).  Differentiating  this 
equation  yields 


dF[o,w] 

 dF[w-i,w] 

dF\w-i,w- 2] 

dF[l,2] 

dF[  o,i] 

dx 

dx 

F[o,w-i\(xv)  d X 

J~[0,W—2]  ixv) 

dx 

■^[0,1]  (*S)  dx 

(32) 


which  is  the  product  of  W  matrices.  The  resulting  algorithm  for  this  discretization  is  given  in 
Algorithm  1. 


Algorithm  1  :  TP  Implementation  of  the  dJ7  scheme 


Step  0  [Macroscale,  Serial]:  Solve  for  the  coarse  solution  and  set  to 
for  u  =  0  to  u  <  max  iterations  do 
Step  1:  [Microscale,  Parallel] 

1.1  Compute  (x%)  for  all  w  by  serial  propagation 

1.2  Set  F(x™)  =jc-[(W](x“) 


1.3  Compute 


^"[o  ,w] 


from  Eq. 


(32) 


Step  2:  [Macroscale,  Serial] 

2.1  mv+ 1  =  rnv  -  rc“+1  +  ^(sjf) 

2.2  x“+1  =  +  mv 


enddo 


Scheme  2  (dC)  :  Direct  Discretization  with  Propagator  Derivative  Approximation 

For  this  scheme,  the  fine  propagator  derivative  in  Algorithm  1  is  replaced  with  the  derivative  of  a 
coarse  operator. 


mv  (33) 

This  substitution  will  reduce  the  computational  cost  of  the  parallel  step.  Additionally,  for  larger 
systems,  dC/dx  could  reduce  the  total  computation  required  in  the  matrix  vector  multiplication  of 
the  macroscale  step. 


dT 

dx 


mnv 


dC 

dx 
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Scheme  3  (Parareal):  Approximation  with  Coarse  Propagators 

With  the  following  substitution: 


mv  — >  C(x™  +17^)  -  C(x“),  (34) 

we  have  the  Parareal  method  described  in  Section  2. 

Scheme  4  (PA):  Propagator  Approximation 

The  function  r  :  RD  X  IRD  ->  is  given  by: 


dF 

dx 


1  ^V,W+l)  —  ^V,W+1  "^""l  ip^v,w) 


(35) 


Unlike  the  propagator  F,  r  does  not  actually  propagate  a  solution  forward  in  time;  rather,  it  more 
directly  measures  the  error  in  x.  Inserting  into  the  TP  PDE,  Equation  (28),  yields: 


’M'VjW+l  4“ 


dr 

dx 


mv 


+  r(xv,w,xv,w+ 1)  =  0 


(36) 


We  define  the  function  that  directly  updates  the  values  of  mv  as  nv^VJ)W+lj 


R°  -»•  RD: 


(37) 


where  n  is  written  in  such  a  way  as  to  succinctly  represent  a  composition  of  functions  in  a  multiscale 
setting: 

'ft‘v,[vu,w+'2]  =  fly ,[w+l,w+2]  °  (38) 

The  derivative  of  n  is  given  by: 


d. 


dn 


v,[w,w-\-l] 


v,[w,w-\- 1] 

dmVAn 


dr 

dx 


(39) 


d  can  be  computed  over  an  entire  macroscale  block  as  a  matrix  product 


,  _  ,  dn[VtV+ 1] 

dv,[0,W]  —  d[v,v+l]  ~ 


dr 

dr 

dr 

dx 

Tu  dx 

,ru  dx 

^v,w- 1 

(40) 


Unlike  Equation  (32),  this  computation  does  not  depend  on  a  serial  propagator.  We  can  advance 
m  on  the  macroscale  by  the  following  equation: 

iHv+i  n[0,„+i](ra0)  ^[0,11+1]  (9 )  <^[ii,d+i]^'[o,j;](9)  4-  ^[v,v+i] (0)  (41) 

d,[v,v+i]'^'v  4~  ^[u,t)+i] (9) 


The  algorithm  for  this  propagator  approximation  (PA)  scheme  is  given  in  Algorithm  2. 

Scheme  5  (mPl):  First  Multi-Propagator  Approximation 

In  this  scheme,  two  TP  updates  are  combined,  one  which  approximates  a  fine  propagator  derivative 
using  a  coarse  propagator  derivate,  and  a  second  TP  update  that  approximates  the  DE  using  this 
fine  propagator  derivative  approximation.  Both  of  these  TP  updates,  along  with  additional  DE 
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Algorithm  2  :  TP  Implementation  of  the  PA  scheme 
Step  0:  [Macroscale,  Serial]  Solve  for  the  coarse  solution  and  set  to  x° 
for  u  =  0  to  u  <  max  iterations  do 
Step  1:  [Microscale,  Parallel] 

1.1  Compute  x°  ,u,for  all  w  by  interpolating  the  coarse  solution 

1.2  Compute  r(x%w,x%w+1)  and  %\  for  all  w 

1.3  Compute  7i[V)t>+i](0)  by  serial  linear  operations  as  in  Eq.  (36) 

1.4  Compute  d\V)V+ u  as  a  matrix  product  from  Eq.  (40) 

Step  2:  [Macroscale,  Serial] 

2.1  17 lv+1  ^[u,i;+l] TYlv  T  f^[v,v+l]  (0) 

2.2  x“+1  =  +  mv 

enddo 


coarse  propagator  computations,  are  computed  in  such  a  way  as  to  still  allow  for  an  efficient  parallel 
implement  at  ion . 


We  first  consider  the  matrix  equation: 

£-£*  =  0 

at  ax 


(42) 


where  |  :  Rx  ^  —X  and  X(t)  :  R  —X  -M .  The  solution  to  Equation  (42)  is  given  by  the  serial 
operation: 


Xt+l  =  dF\XtXt 


(43) 


An  approximation  to  this  solution  can  be  found  by  the  same  means  as  an  approximation  was  found 
to  the  original  equation  of  interest  given  by  Equation  (2).  An  initial  approximation  given  by  a 
coarse  solution 

-^'■y,[0,iu]  dCv^  0,u>]  (44) 

can  be  updates  on  the  microscale  along  the  lines  of  Equations  (33)  and  (36).  Specifically,  the  vector 
valued  function  x  is  replaced  with  the  matrix  valued  function  X  and  dr/dx  is  replaced  by  dC.  This 
equation  is  given  by: 

,\w ,W+1\  <^^'v,[w,w+l]^^-v,[0,w]  [0,10+1]  4~  dJ~vt[vj,W-\-l]X-v,[0,w]  (45) 


This  equation  can  be  solved  for  by  serial  linear  operations  and  it  is  also  equal  to: 

W—l 

X^^-v,[0,W]  'y  '  d^v,[i-\-l,W]  1]  dCv, [j,i+l])  y]  (46) 

i= 0 


The  sum  in  Equation  (46)  is  an  o(W)  operation  and  can  potentially  be  computed  without  significant 
cost  beyond  running  the  fine  propagator  on  the  interval  of  size  dv  in  some  cases.  However,  here,  the 
sum  is  simply  approximated.  Using  a  discrete  Simpson’s  method  at  the  values  at  i  =  0,  i  =  b\/2, 
and  i  =  b\  where  b\  =  W  —  1  and  is  divisible  by  2,  this  approximation  is  given  by: 

X[^-v\0 ,W]  ~  kod^v,[l,bi+i\  {d^v,[0,l]  ^'u,[0,l]) 

+  kj>1/2dCv,[bi/2+l,bi+l]  (dX v\b1/2,b1/2+l]—dCv^1/2,b1/2+l})  d£v,[ 0,6i/2]  (47) 

kb1  {^dJ~v^bi,bi-\-i]  ^'ti,[6i,6i+l])  dCv^y o,fei] 
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where  ko,  kbl/ 2,  and  kbl  are  DE-independent  constants  of  integration.  In  order  to  simplify  the  sum 
in  equation  47,  dC  on  the  requisite  intervals  is  approximated  to  be  either  dCh  or  dCp  depending  on 
the  interval  length.  That  is, 


and 


t^t),[0,6i/2]  —  d£v,[l,bi/2+l]  —  dCv,[b!/2,W-l]  ~  dCV,[b1/2+l,W]  ~  dCh 


(48) 


dCv\ 0,1]  —  dCv,[bi/ 2,6i/2+i]  —  dCV:[w-i,w]  —  dCp 


(49) 


The  matrix  dCh  is  computed  simply  averaging  the  derivative  of  coarse  operators  run  on  an  intervals 
of  size  &i/2.  The  matrix  dCp  is  approximated  by: 

dCp  =  ( dCh )  6i  =  exp  In  (dCh)^j  (50) 

computed  by  a  Taylor  expansion.  The  Taylor  expansion  is  truncated  based  on  the  number  of  matrix 
multiplications  required  in  the  computation  of  Ml  which  is  3  in  this  case.  It  is  also  assumed  that 
dCp  is  a  close  enough  approximation  to  the  true  value  that  the  commutation  error  of  dCp  and  dCh 
can  be  ignored.  An  improvement  upon  the  the  coarse  operator  dC  is  then  given  by: 


duiPl^y ;t;+i]  d£v,[o,w\  ~h  -^^-v,[o,w] 


(51) 


The  complete  algorithm  is  summarized  in  Algorithm  3. 


Algorithm  3  :  TP  Implementation  of  the  mPl  scheme 

Step  0:  [Macroscale,  Serial]  Solve  for  the  coarse  solution  and  set  to  for  u  =  0  to  u  <  max  iterations 
do 

Step  1:  [Microscale,  Parallel] 

1.1  Compute  x°u,for  all  w  by  interpolating  the  coarse  solution 

1.2  Compute  w,  W+1 )  and  for  all  w 

1.3  Compute  (0)  by  serial  linear  operations  as  in  equation  36 

1.4  Compute  dCv^hl/2]  and  dCVt[bl/2+i,W] 

l.-j  ddh  2  (^'■u,[o,&i/2] -\-dCv,\bl/2+i, w\) 

1.6  Compute  dCp  by  Taylor  expanding  Eq.  (50)  to  order  3. 

1.7  dCVy[ o,w]  =  dChdCpdCh 

1.8  kodCh  dCh  dp [o. l j  T  kblj2dChdP ■u^bl/2,b\/2+i]  d-Ch 

+  kbl  dPv^  [fel  )bl+i]  dC  h  dCh  —  (ko+ kbl/2 + kbl )  dC  h  dCp  dC  h 

1.9  drnPlv^0  W]  =  dCv^0^+Mlv  ^  w^ 

Step  2:  [Macroscale,  Serial] 

2.1  mv+ 1  =  dmPl[V}V+1]mv  +  n^+i]  (0) 

2.2  x“+1  =  Xy  +  mv 

enddo 


Scheme  6  (mP2):  Second  Multi-Propagator  Approximation 

A  second  iteration  on  dC  such  that 

dmP2v[0W]  =  dCVj[  0,w]  +  M  lv,[o,W]  +  ^^v,[o,w]  (52) 
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where  M2  can  be  computed  as: 


W—l  j- 1 

M2v,[o  ,w]  =  EE  dCv,[j+i,wy 

j= 1  i= o 


(dJ~v  \j  j-^i]  d,Cv  \j  j_\_ i])  dC-i+j+i j]  ■ 
(^*^u,[i,j+l]  dCv,[i,i+ 1])  ^'D,[0,i] 


(53) 


This  sum  is  computed  with  a  discrete  Boole’s  method  at  the  following  (i,j)  values,  where  62  =  W— 2: 


(0,1) 

(0,  362/4  +  i) 
(62/4,62/2  +  1) 
(62/2,62/2  +  1) 
(362/4,  362/4  +  1) 


(0,62/4+1) 
(0,62  +  1) 
(62/4,362/4  +  1) 
(62/2,362/4  +  1) 
(362/4,62  +  1) 


(0,62/2  +  1) 

(62  /4,62  /4  +  1) 
(62/4,62  +  1) 
(62/2,  62  +  1) 
and  (62,  62  +  1) 


Scheme  7  (hAl):  First  Higher  Accuracy  Approximation 

Many  ways  of  improving  the  accuracy  and  stability  of  PDEs  do  not  directly  apply  to  the  TP  PDE. 
The  discretization  in  Scheme  1  does  have  a  forward  Euler  resemblance,  but  the  propagator  F  can 
be  considered  to  contain  arbitrary  accuracy  in  the  v  variable,  and  the  derivative  with  respect  to  u 
is  already  well  conditioned  with  step  size  du  =  1.  The  basis  for  this  higher  accuracy  method  is  an 
approximation  to  the  midpoint  discretization  for  the  derivative  of  the  propagator  with  respect  to  u. 


1  dr 

2  dx 


mv  + 

,w 


1  dr 

2  dx 


mv  +  r(x^w,x^w+ 1)  =  0 

~.u 

^v,w 


(54) 


The  approach  is  similar  to  the  scheme  5  (mPl)  except  that  du  is  the  initial  approximation  rather 
than  the  the  derivative  of  a  coarse  propagator. 


Xv,[Q,w\  dv,[0,w] 

This  matrix  is  then  updated  to  approximate  du+ 1 , 

M]-v,[w,W+\]  —  ^v,[w,w  +  l]-^"ll>,[0,ut]  ~  XV,[0,W  +  1]  T 

where  Ml  can  be  directly  computed  as: 

W-i 

M1  V,[0,W]  =  ^2  dv,[i+l,W]  1]  “  dv,[i,i+ 1])  dv,[0,i\ 

1=0 


(55) 


(56) 


(57) 


The  sum  is  approximated  by  evaluating  i  at  3  points. 
MlVj[o,W]  pa^0^j[il61+i]  —  ^,[0,1]) 

+  ^i/2<,[6l/2+l,61+l]-  {dvtbi 


—  Au  \  rlu 

[6l/2,6i/2+l]  [6i/2,6i/2+l]  J  [0,6i/2] 


(C^i+r+ll  -<[6!+!+!])  <[0+!] 


The  approximation  to  half  of  du  plus  half  of  du+l  is  then  given  by: 


(58) 


dhAlvjpW}  —  0,1+]  +  2  M^v,[o,w] 


(59) 
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Algorithm  4  :  TP  Implementation  of  the  liAl  scheme 


Step  0:  [Macroscale,  Serial]  Solve  for  the  coarse  solution  and  set  to 
for  u  =  0  to  u  <  max  iterations  do 
Step  1:  [Microscale,  Parallel] 

1.1  Compute  x°u,for  all  w  by  interpolating  the  coarse  solution 

1.2  Compute  r(x%w,x%w+1)  and  fr  I  for  all  w 

1.3  Compute  (0)  by  serial  linear  operations  as  in  equation  36 

1.4  Compute  d^.u  t-n/2\  and  (F;  j  /2+ 1  6,]  by  matrix  multiplication 

I  r  fjU  _  ju  JU 

10  “u,[0,6i/2]  _  “u,[l,6i/2]au,[0,l] 

;u  _  ju  ju 

%,[l,6i/2+l]  ^  %,[hi/2,6i/2+l]ar,[l,b1/2] 

%,  6i/2+l,6i+l]  “  av, [61+1, 61]%, [61/2+1,6!] 
av, [61/2,61]  “  %, [61/2+1,6!]%, [61/2, 6i/2+l] 

16  ^,[1,61  +  1]  =  ^,[61/2+1,61+1]^, [l,6i/2+l] 

JU  _  JU  JU 

%,[0,6i]  -  %,[6i/2, 61]%, [0,6i/2] 

17+  —  ju  r/u  ju 

II  Uv,  [0,61  +  1]  —  Uv,  [61/2+1,61+1]%,  [61/2, 61/2+1]%,  [0,61/2] 

Step  2:  [Macroscale-Microparallel] 

2.1  Compute  at  at  i  =  0,  i  =  b\/2,  and  i  =  b\. 

2.2  Compute:  , 

n  k0  kbi/2  kbi  \,u  1  k0  ji 

rriv+i  =(!  -  — - j - 2“)rf[„, „+!]"»,;  +  yd. 


JU+l 

xu,[l,6i+l]%,[0,l] 


mr. 


+ 


+ 


2 

2 
fcbi 
2 


ju  JU+1  rlu  m 

llv, [6i/2+l, 6i+l]au, [6i/2, bi/2+l]Uv, [0,6i/2]  ,nv 


^blM+l]dv,[0,bl]mv  +  nb,«+l](°) 


2.3  x“+1  = 


enddo 
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Unlike  Schemes  1,  2,  4,  5,  and  6,  this  scheme  requires  additional  matrix  vector  multiplications  for  the 
macroscale  step  due  to  the  fact  that  1"A+1j ,  computed  at  three  values  of  i  cannot  be  precomputed 
in  the  macro-parallel  step.  The  complete  algorithm  is  summarized  in  Algorithm  4 

Scheme  8  (hA2):  Second  Higher  Accuracy  Approximation 

A  closer  approximation  to  Equation  (54)  can  be  obtained  by  considering  a  second  iteration  to  the 
approximation  of  du+l .  An  approximation  to  the  average  of  du  and  du+l,  dhA2,  is  given  by: 

dhA2V}[0jWj  =  d%,[o,w]  +  2M^v’[°’W}  +  9  M^v,[o,w]  (60) 


where 


W-ij- 1 

M2v,[o ,w]  =  E  E  dCv,\j+i,w] 

j= 1  *= 0 


{dJ7v,\j,j+ 1]  d£v,\j 


d+i]) 


.[*+i  j] 


(dFv 


[i,i+l]  d^-'v,{i,i+l])  dCVt[0A 


(61) 


Additionally,  a  better  approximation  to  n^)U+1]  can  be  found,  is  computed  based  only 

on  the  values  of  du  where  as  nhA2\v  v+1-i  incorporates  the  difference  between  dhAl  and  du  on  the 
microscale. 

nhA2Vt[  0,w]  =  nVj[  0,w)  +  (62) 

where  ml  is  defined  by  the  equation 

{dhA\v^w,w+\\  ^  )  Tn^-v,[o,w]  F  t(xV  WiXv  w_^_ i)  (63) 

which  is  equal  to: 


m^v,[0,W] 


W-l  j 

EE  dv,\j+l,W) 

j= 1  *=o 


\j,j+ !]  dV,\j,3  + 1] 


d 


U 


r{x^,x 


U  \ 
v,i-\- 1) 


(64) 


The  TP  implementation  of  this  scheme  is  similar  to  that  of  Algorithm  4  but  with  15  (■ i,j )  locations 
evaluated  instead  of  3  i  locations  and  a  matrix  multiplied  by  a  vector  up  to  5  times  serially  instead 
of  3. 

Scheme  9  (mPl-hAl):  First  Multi-Propagator  and  Higher  Accuracy  Approximation 

This  scheme  is  similar  to  the  hAl  scheme,  but  rather  than  compute  du on  [1,  b\/2]  and  [&i/2  +  1,  b\], 
the  approximation  to  du  given  by  dmP  1,  as  described  in  Algorithm  3,  is  employed  on  each  of  these 
two  subregions  in  order  to  reduce  the  number  of  required  matrix  multiplications. 

Scheme  10  (mP2-hA2):  Second  Multi-Propagator  and  Higher  Accuracy  Approximation 

For  this  scheme,  dm,P2  is  computed  on  [1,62/4],  [62/4+1,62/2],  [62/2  +  1,362/4]  ,  and  [362/4+1,62] 
rather  than  du .  As  with  the  mPl-hAl  scheme,  this  is  done  to  reduce  computational  cost. 

3.3  TP  Implementation  of  A  Serial  Implicit  Scheme 

The  stability  of  the  serial  propagators  requires  that  they  be  bounded  in  H 1  in  the  usual  sense. 
Classical  stability  analysis  (Von  Neumann  stability  analysis)  assumes  linearity,  in  which  case  the 
serial  solution  is  the  obtained  from  the  repeated  application  of  ^  and  that  the  magnitude  of  this 
operator  is  less  than  1.  The  stability  of  the  serial  schemes  carries  over  to  the  TP  schemes  that 
only  depend  on  the  coarse  propagator  derivative  or  only  depend  on  the  fine  propagator  derivative. 
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But  instability  may  occur  for  the  multipropagator  schemes  5  and  6  should  the  difference  cj^  —  ^ 
become  relatively  large.  Additionally,  the  propagator  derivative  will  often  be  written  in  terms  of  a 
matrix  inverse  when  derived  from  implicit  serial  schemes.  For  both  of  the  aforementioned  reasons, 
we  describe  a  fully  TP  implicit  scheme. 

The  generic  form  of  the  propagator  in  Equation  (2)  could  be  considered  to  encapsulate  an  implicit 
serial  scheme.  However,  we  instead  explicitly  delineate  the  equation  as  a  time  reverse  propagation: 

F(xt+i)  -  xt  =  0  (65) 


For  scheme  1  (dJ7),  Equation  (31)  is  replaced  by: 

rriv+i  =  ( mv  -  F(x%+1)  +  (66) 

Similarly,  for  the  PA  scheme,  Equation  (41)  is  replaced  by: 

d[v,v+i\^iv+i  T  Ti[v,v-\-i]  (0))  (67) 


dF 

dx 


Furthermore,  a  matrix  equation  to  replace  Equation  (42)  is: 


Xt=dF\ \Xt+1Xt+1 


(68) 


which  is  the  same  equation  with  the  direction  of  time  reversed.  Therefore,  Ml  is  simply 

W-l 

M1V,[0 ,W]  'y  ]  dCv,[0 ,i]  (dF )-l]  | (69) 

i= 0 


and  M2  is 


W-ij-i 

M2vm  —  y  '  y  ^  dCv  ^ij  (dFv>[ij+ 1]  ^v,[i+ij]  f'7rq 

j= l  i= o  (79) 

(dFv , [j , jf + 1]  d^"v,\j,j+ 1])  dCv  \jjr i:\v] 

dmP  1  and  dmP 2  can  be  computed  by  Equations  (51)  and  (52)  respectively  and  can  replace  d^+i] 
in  Equation  (67). 

3.4  Accuracy  Considerations 

For  serial  propagators,  accuracy  can  usually  be  analyzed  rigorously  in  the  limit  as  the  time  step 
size  goes  to  zero  and  error  accumulates  at  each  serial  step  [4].  TP  methods  are  fundamentally  more 
challenging,  as  the  convergence  of  a  TP  method  depends  on  the  global  nature  of  the  solution.  Some 
of  the  challenges  of  an  accurate  TP  implementation  are  given  in  the  example  problems  of  Section 
4.  From  the  schemes  presented  here,  one  can  observe  how  well  each  incorporates  the  change  in  the 
propagator  along  the  path  of  the  solution.  For  schemes  2  (dC)  and  3  (Parareal),  the  approximate 
error  is: 


$(x“+1)||  >  o 


d2F 

dx2 


(71) 
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For  schemes  1  (dJ7),  4  (PA),  5  (mPl),  and  6  (mP2)  we  have 


|$(®: 


“+i)ii=»( 


d2T 


dx2 


(72) 


And  for  schemes  7  (hAl),  8  (hA2),  9  (mPl-hA2),  and  10  (mP2-hA2) 

d27 


|$(^+1)||  <o 


dx 2 


\x00-xu\ 


(73) 


4  Sample  TP  Results  and  Discussion 

4.1  Setup 

All  of  the  TP  methods  are  associated  with  a  fine  propagator,  F,  computed  based  on  a  microscale 
time  step  size  dw ,  as  well  as  a  coarse  propagator,  C,  computed  with  time  step  size  dv.  The  DEs 
are  initialized  with  coarse  propagators  for  several  different  schemes.  The  microscale  time  step  size 
dw  is  fixed  at  a  small  value  while  dv  is  varied  to  as  large  a  value  as  possible.  The  error  reduction 
after  the  first  TP  iteration  (Figures  2,  6,  and  10)  as  well  as  number  of  TP  iterations  required  for 
convergence  (Figures  3,  7,  11)  were  recorded.  The  sequential  solution  with  dw  sized  fine  time  steps 
was  considered  the  ’exact’  solution,  and  the  non-serial  methods  were  iterated  until  the  error  relative 
to  that  solution  at  the  last  time  point  was  less  than  one  tenth  the  error  from  considering  the  serial 
solution  with  2  dw  sized  fine  time  steps  in  serial.  This  balance  was  chosen  to  give  some  justification 
to  employing  the  dw  sized  time  steps  but  at  the  same  time  avoid  considering  the  region  of  potentially 
slow  convergence  that  comes  about  due  to  double  precision  floating  point  operation  error  in  all  of 
the  methods. 

The  DEs  were  chosen  to  have  parameters  with  vastly  different  scales  and  be  somewhat  challenging 
as  discussed  in  Section  4.2.  While  the  coarse  time  step  size,  dv,  is  varied,  particular  attention  is 
paid  to  the  case  when  the  coarse/fine  step  ratio  is  high.  The  initial  coarse  solution  must  still  be 
solved  for  serially,  and  therefore,  this  case  where  the  coarse  solution  is  as  coarse  as  possible  offers 
the  opportunity  for  the  highest  possible  level  of  parallelization.  In  addition  to  error  reduction  and 
convergence  rates,  particular  focus  on  failure  is  made.  Not  all  of  the  TP  schemes  succeed  for  all  of 
the  given  coarse  solutions.  The  failure  depends  both  on  the  robustness  of  the  TP  scheme  and  the 
global  nature  of  the  solution. 


4.2  Example  DEs 
Example  1.  Lotka-Volterra 

The  Lotka-Volterra  equation  with  a  quadratic  growth  term  is  given  by: 

dx  _  o 

—  =  ax  —  pxy  +  yx 
dt 

dy  2 

—  =  ryxy  -  ay  +  XV 

with 

a  =  10,p  =  .01,  7  =  .0001,  7]  =  .01,  cr  =  10,  x  =  -00002 

x(0)  =  1000,  y(0)  =  1000,  t  €  [0, 10] 


(74) 
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This  model  represents  the  population  of  two  interacting  species,  a  predator  and  prey.  The  quadratic 
term  causes  a  slow  acceleration  away  from  the  relatively  rapidly  rotating  population  sizes  (in  phase 
space)  and  gives  the  DE  a  multiscale  nature.  The  serial  coarse  propagator  is  forward  Euler  and  the 
serial  fine  propagator  is  the  second  order  Runge-Kutta  scheme. 


xt+ 1  =  T(xt)  ->xt+i  =  xt  +  dtf(xt)  Coarse 

xt+i  =  T(xt)  ->xt+ 1  =  xt  +  dtf(xt  +  y/(a;t))  Fine 


(75) 


The  coarse  step  number  was  varied,  resulting  in  differing  amounts  of  accuracy  in  the  initial  coarse 
solution,  and  the  fine  step  number  was  fixed  at  approximately  220  steps  with  rounding  as  necessary 
for  each  TP  scheme.  Figure  1  plots  in  phase  space  the  difference  between  this  coarsest  possible  so¬ 
lution  and  the  fine  solution.  At  this  level  of  coarseness  the  two  solutions  begin  to  differ  qualitatively 
due  to  the  coarse  solutions  inability  to  accurately  track  small  scale  nonlinearities  over  large  times. 

Example  2.  Pendulum  Equation 

The  nonlinear  pendulum  equation  is  given  by: 


dx 


dy  —a  sin(/3v/x)) 
dt  yfx 


(76) 


with 

a  =  1,  p  =  1,  x(0)  =  0,  y(0)  =  .5  ,t€  [0,  2000] 

The  solutions  are  nearly  circular  (like  the  classic  pendulum  approximation)  but  the  significant  non¬ 
linearity  of  the  DE  can  lead  to  large  solution  error  during  the  course  of  TP  implementation.  The 
serial  propagator  for  both  coarse  and  fine  solutions  is  the  Leapfrog  scheme 


xt+i  =  T(xt)  -t 


xt+i  =  xt  +  dtyt  +  ^a(xt) 

2/m  =2 h  +  f  (a(xt)  +  a(xt+i)) 


(77) 


The  fine  step  number  is  fixed  near  220  time  steps  and  the  coarse  step  number  is  varied.  The  serial 
Propagator  is  symplectic  and  a  high  degree  of  energy  conservation  is  maintained  for  both  the  fine 
and  coarsest  solution  as  shown  in  Figure  5.  In  contrast  to  the  Lotka-Volterra  example,  the  coarse 
and  fine  solutions  are  qualitatively  similar,  but  this  fact  alone  is  insufficient  to  guarantee  that  all 
TP  schemes  will  converge  quickly  to  the  fine  solution  as  discussed  in  Section  4.3. 

Example  3.  Stiff  non-linear  equation 

An  equation  that  is  stiff,  nonlinear,  and  involving  multiple  scales,  to  provide  a  challenging  imple¬ 
mentation,  is  given  by  : 

=  —~(x  +  /3x3)  +  y2e~l3y 

f  1  (78) 

J  e-X~P  (l  +y)+ye-Py 

at  a 

with 

a  =  1000,  p  =  .1,  7  =  .001  s(0)  =  1,  2/(0)  =  0  ,t€  [0, 10] 
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It  is  implemented  for  both  fine  and  coarse  solutions  with  the  backward  Euler  scheme  : 


Jr(xt+ 1)  =  xt  ->■  xt+i  -  dtf(xt+ 1)  =  xt  (79) 

The  solution  solution  has  215  fine  time  steps.  Stiff  DEs  offer  particular  challenge  in  the  serial  case 
and  additional  difficulties  in  the  TP  case  as  discussed  in  Section  4.3.  Figure9  shows  phase  space 
plots  of  the  most  accurate  solution  to  the  DE  and  the  coarsest  approximation. 

4.3  TP  Method  Comparison  Discussion 

4.3.1  Coarse  Approximation  Scheme  Results 

Scheme  2  (dC)  and  scheme  3  (Parareal)  performed  similarly  for  the  cases  where  both  schemes  were 
able  to  converge.  For  both  of  these  schemes,  the  error  of  the  TP  iteration  is  dominated  by  the 
difference  between  the  first  derivative  of  the  coarse  propagator  and  the  first  derivative  of  the  fine 
propagator.  The  Parareal  method  is  more  robust  to  the  large  nonlinearity  present  in  the  pendulum 
DE  and  is  able  to  converge  to  the  fine  solution  in  cases  where  the  d C  scheme  is  not  able  to  converge. 
However,  the  number  of  iterations  required  for  convergence  becomes  quite  large  as  shown  in  Figure 
7.  Schemes  2  and  3  require  more  iterations  to  converge  when  compared  to  other  schemes  and 
this  distinction  in  the  rate  of  convergence  expands  as  the  coarse  to  fine  step  size  ratio  expands. 
Additionally,  these  two  schemes  do  not  converge  at  all  when  the  coarse  to  fine  step  size  ratio 
becomes  large  enough. 

4.3.2  Multi-Propogator  Scheme  Results 

Scheme  1  (dJ7)  performed  nearly  identically  to  the  PA  scheme  (Scheme  4).  Scheme  1  performs 
non-linear  operations  on  the  microscale,  where  as  the  PA  method  relies  on  linear  approximations 
at  the  microscale.  It  does  not  appear  that  the  microscale  non-linear  computations  are  meaningfully 
incorporated  into  the  TP  update  to  improve  its  accuracy  in  the  general  case. 

Schemes  5  (mPl)  and  6  (mP2)  are  reasonable  approximations  to  the  PA  scheme.  This  is  demon¬ 
strated  in  the  error  reduction  and  convergence  results  of  the  Lotka-Volterra  and  pendulum  exam¬ 
ples.  The  approximation  from  the  mPl  scheme  is  better  than  the  dC  scheme  (scheme  2)  but  slightly 
worse  than  the  PA  scheme  while  the  more  accurate  mP2  scheme  provides  results  very  close  to  the 
PA  scheme. 

For  the  stiff  DE,  the  dF  and  PA  schemes  fail  due  to  the  linear  operators  becoming  un-invertible 
to  machine  precision.  The  mP2  scheme  fails  for  the  same  reason.  Only  the  mPl  scheme  is  able  to 
succeed  with  the  coarsest  initial  solution  due  to  it  being  more  accurate  than  the  dC  scheme  yet  not 
so  accurate  as  to  involve  a  machine  precision  un-invertible  linear  operator. 

4.3.3  Higher  Accuracy  Scheme  Results 

The  higher  accuracy  schemes  7  (hAl),  8  (hA2),  9  (mPl-hAl)  and  10  (mP2-hA2)  provide  greater 
error  reduction  after  the  first  iteration,  faster  convergence  when  the  initial  coarse  and  fine  solutions 
disagree  to  a  significant  degree,  and  are  able  to  succeeded  in  cases  where  other  TP  methods  fail. 
However,  updating  the  solution  in  this  way  does  require  several  additional  matrix  vector  multipli¬ 
cations  at  each  macroscale  time  step. 

4.4  3.4  GPU  Computing  Results 

Six  of  the  TP  schemes  were  implemented  using  a  single  CPU  and  GPU.  The  microscale/parallel  step 
in  Algorithms  1,  2,  3,  and  4  was  performed  on  the  GPU  with  each  thread  given  one  value  of  v.  The 
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macroscale/serial  step  was  performed  in  serial  on  the  CPU.  While  a  higher  degree  of  parallelization 
is  possible  for  macroscale/serial  step  in  the  case  of  larger  systems  and  more  computational  capacity 
(the  operations  are  simply  matrix  multiplication  and  vector  addition),  even  this  limited  paralleliza¬ 
tion  can  demonstrate  some  of  the  main  factors  that  determine  the  computing  time.  As  discussed  in 
Section  2.1,  the  number  of  serial  computations  should  be  limited  as  much  as  possible,  which,  in  the 
context  of  the  schemes  described  in  Section  3.2,  means  that  the  macro-step  size,U,  should  be  small. 
For  a  fixed  total  fine  time  step  number  that  is  the  product  of  V  and  W.  a  small  V  is  obtained  by 
making  the  coarse  to  fine  step  ratio,  W.  large.  Figures  4  and  8  show  that  the  TP  iterations  are 
faster  as  V  decreases.  The  speed  up  is  limited  in  this  case  because  order  W  serial  computations  are 
performed  on  each  thread.  However,  this  microscale  serialization  on  each  macroscale  [u,  v  +  1]  block 
can  be  treated  as  a  subproblem  in  which  another  layer  of  TP  iterations  can  be  performed  should 
the  computing  capacity  be  available. 

The  principal  difference  amongst  the  TP  schemes  was  that  those  that  produced  the  least  solution 
error,  dJ7,  PA,  mPl,  and  hAl,  took  roughly  two  to  three  times  longer.  This  is  mainly  due  to  the 
fact  that  on  the  microscale,  the  Parareal  and  dC  schemes  require  only  the  calculation  of  the  fine 
propagator  F(xv,w),  where  as  the  others  require  computations  of  both  the  fine  propagator  and  its 
derivative.  The  hAl  scheme  also  requires  additional  macroscale  calculations  and  macroscale  data 
transfers  as  described  in  Section  3.2.  This  results  in  a  higher  computational  cost  when  W  is  small 
in  comparison  to  the  other  schemes.  Whether  the  additional  time  cost  of  each  TP  iteration  is 
worthwhile  is  highly  problem  dependent  but  in  some  cases  may  result  in  faster  convergence  or  an 
orders  of  magnitude  error  reduction,  as  discussed  in  Section  4.3. 

We  hope  to  further  examine  TP  computing  speed  increases  in  the  context  of  larger  scale  ODEs 
and  PDEs  in  the  near  future.  In  particular,  the  advantages  of  the  mPl  scheme  is  not  revealed  in 
the  present  analysis  because  of  the  small  system  size.  But  all  of  the  schemes,  except  the  Parareal, 
may  show  additional  speed  increases  when  the  linear  operations  in  the  macroscale  step  can  be 
parallelized. 

5  Applicability  Considerations 

The  sample  results  provide  some  indication  of  the  advantages  and  limitations  of  the  various  TP 
schemes.  Many  DEs  can  be  solved  perfectly  well  in  a  reasonable  amount  of  time  with  established 
serial  methods.  Approaches  to  solving  these  equations  faster  with  significant  parallel  computing 
capacity  has  emerged  as  a  challenging  research  area  in  its  own  right  with  practical  applications. 
Other  DEs  due  to  their  size  or  multiscale  nature,  cannot  be  solved  reasonably  using  classical  serial 
computations. 

We  hope  to  apply  these  TP  schemes  to  more  challenging  DEs  in  the  future  where  we  would  expect 
a  magnification  of  the  TP  performance  differences  observed  here  as  the  coarse  and  fine  solutions 
diverge  more  drastically  and  more  parallel  computing  capacity  can  be  applied.  The  solutions  to 
these  more  challenging  problems  will  often  exhibit  a  wide  range  of  complicated  behavior  such  as 
shocks,  instabilities,  high  dimensionality,  and  a  complex  multiscale  nature.  The  methodology  and 
results  presented  here  gives  some  reason  for  optimism  to  the  prospect  of  TP  methods  that  can  be 
arbitrarily  improved  to  fit  the  challenge  of  at  least  some  of  the  more  problematic  PDEs  of  widespread 
interest  and  will  be  the  subject  of  future  work. 
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Figure  1:  A  phase  space  plot  of  the  Lotka-Volterra  example  given  in  Equation  74.  The  fine  solution 
is  black  and  the  coarsest  solution  provided  as  an  input  to  each  of  the  TP  schemes  is  in  cyan. 
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Error  Reduction  Results  for  the  Lotka 
Volterra  ODE 
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Figure  2:  The  error  reduction  after  one  TP  iteration  is  given  for  each  of  the  ten  schemes  presented 
in  Section  3.2  for  the  Lotka- Volterra  example  given  in  Equation  74. 


Convergence  Results  for  the  Lotka 
Volterra  ODE 
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Figure  3:  The  number  of  TP  iterations  until  approximate  convergence  (i.e.  when  the  error  is  below 
one  tenth  the  error  of  a  run  with  twice  the  step  size)  is  shown  for  each  of  the  ten  schemes  presented 
in  Section  3.2  for  the  Lotka- Volterra  example  given  in  Equation  74. 
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Computation  Time  Results  for  the  Lotka 
Volterra  ODE 
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Figure  4:  The  computation  time  (“wall-clock”)  for  a  single  TP  iteration  was  computed  for  six 
different  TP  schemes  for  the  Lotka- Volterra  example.  The  parallel  computations  were  performed  on 
a  single  GPU  and  the  serial  compuations  were  performed  on  a  single  CPU.  In  general,  TP  iterations 
that  generated  less  solution  error  required  modestly  higher  computation  cost. 


Figure  5:  A  phase  space  plot  of  the  pendulum  example  given  in  Equation  76.  The  fine  solution  is 
black  and  the  coarsest  solution  provided  as  an  input  to  each  of  the  TP  schemes  is  in  cyan. 
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Error  Reduction  Results  for  the  Pendulum 
ODE 
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Figure  6:  The  error  reduction  after  one  TP  iteration  is  given  for  each  of  the  ten  schemes  presented 
in  Section  3.2  for  the  pendulum  example  given  in  Equation  76. 


Convergence  Results  for  the 
Pendulum  ODE 
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Figure  7:  The  number  of  TP  iterations  until  approximate  convergence  (i.e.  when  the  error  is  below 
one  tenth  the  error  of  a  run  with  twice  the  step  size)  is  shown  for  each  of  the  ten  schemes  presented 
in  Section  3.2  for  the  pendulum  example  given  in  Equation  76. 
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Computation  Time  Results  for  the 
Pendulum  ODE 
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Figure  8:  The  computation  time  (“wall-clock”)  for  a  single  TP  iteration  was  computed  for  six 
different  TP  schemes  for  the  pendulum  example.  The  parallel  computations  were  performed  on  a 
single  GPU  and  the  serial  compilations  were  performed  on  a  single  CPU.  In  general,  TP  iterations 
that  generated  less  solution  error  required  modestly  higher  computation  cost. 


Figure  9:  A  phase  space  plot  of  the  stiff  example  given  in  Equation  78.  The  fine  solution  is  black 
and  the  coarsest  solution  provided  as  an  input  to  each  of  the  TP  schemes  is  in  cyan. 
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Convergence  Results  for  the 
Nonlinear  Stiff  ODE 
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Figure  10:  The  error  reduction  after  one  TP  iteration  is  given  for  each  of  the  ten  schemes  presented 
in  Section  3.2  for  the  stiff  example  given  in  Equation  78. 


Error  Reduction  Results  for  the  Nonlinear 
Stiff  ODE 


Coarse  Step  Size  /  Fine  Step  Size 

0  500  1000  1500  2000  2500 


Parareal 

mPl 

mP2 


Figure  11:  The  number  of  TP  iterations  until  approximate  convergence  (i.e.  when  the  error  is  below 
one  tenth  the  error  of  a  run  with  twice  the  step  size)  is  shown  for  each  of  the  ten  schemes  presented 
in  Section  3.2  for  the  stiff  example  given  in  Equation  78. 
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